Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Do not update MS status when unable to get workload cluster or machine node #10436

Closed
wants to merge 1 commit into from

Conversation

jessehu
Copy link
Contributor

@jessehu jessehu commented Apr 15, 2024

What this PR does / why we need it:
The ErrClusterLocked error causes the MD.Status.ReadyReplicas changes from 3 to 0 and after about 90s it will be changed back to 3. The reason is updateStatus() in machineset_controller.go ignores the error returned by getMachineNode() and treats the Node as not ready. Setting MD.Status.ReadyReplicas from 3 to 0 is unreasonable and causes issues in our project on top of CAPI.

Which issue(s) this PR fixes:
Fixes #10195

Changes
Do not update any fields of MS.Status when unable to get workload cluster or machine Node due to ErrClusterLocked or any other errors. Because the ErrClusterLocked error can be recovered soon after reconciling again, and the error that cannot get machine Node (e.g. network issue, or apiserver unavailable temporarily) should also be recovered soon after reconciling again.
Please refer to more discussion on #10229 (comment).

Test:

  • The UT code for existing MS.Status update logic does not contain fake Reconciler.Tracker object and not consider the getNode logic. So it needs more changes to add UT for this error scenario.
  • Ran e2e test for 3 CP & 3 Worker cluster creation with updating CAPI components version(based on CAPI 1.5.2) after the cluster created, observed the following expected behavior:
  • MD.Status.ReadyReplicas won't change from 3 to 0 when getMachineNode() hit ErrClusterLocked error. The log "Requeuing because another worker has the lock on the ClusterCacheTracker" was printed.

@sbueringer @fabriziopandini please kindly take a look again. Thanks for you time!

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area PR is missing an area label labels Apr 15, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign chrischdi for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

Hi @jessehu. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 15, 2024
@jessehu
Copy link
Contributor Author

jessehu commented Apr 15, 2024

/area machineset

@k8s-ci-robot k8s-ci-robot added area/machineset Issues or PRs related to machinesets and removed do-not-merge/needs-area PR is missing an area label labels Apr 15, 2024
@sbueringer
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 15, 2024
@@ -883,8 +888,7 @@ func (r *Reconciler) updateStatus(ctx context.Context, cluster *clusterv1.Cluste

node, err := r.getMachineNode(ctx, cluster, machine)
if err != nil && machine.GetDeletionTimestamp().IsZero() {
log.Error(err, "Unable to retrieve Node status", "node", klog.KObj(node))
continue
return errors.Wrapf(err, "unable to retrieve the status of Node %s", klog.KObj(node))
Copy link
Member

@sbueringer sbueringer Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this doesn't address:

For MachineSet specifically: It definitely solves the "happy path", but it makes the "unhappy path" worse by freezing the status indefinitely. Without the changes the ready and available replicas were dropping to 0 when the communication broke down. Probably a matter of preference if in case of the communication breakdown we prefer to keep ready and available replicas as is or drop them to 0. But another side effect is also that we don't update ObservedGeneration anymore, which definitely seems wrong (similar for the conditions)

Specifically:

"unhappy path" worse by freezing the status indefinitely

But another side effect is also that we don't update ObservedGeneration anymore, which definitely seems wrong (similar for the conditions)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sbueringer.

"unhappy path" worse by freezing the status indefinitely

This PR does not update any fields of MS.Status when unable to get workload cluster or machine Node due to ErrClusterLocked or any other errors. Because the ErrClusterLocked error can be recovered soon after reconciling again, and the error that cannot get machine Node (e.g. network issue, or apiserver unavailable temporarily) should also be recovered soon after reconciling again. Are there other "unhappy path" you meant?

But another side effect is also that we don't update ObservedGeneration anymore, which definitely seems wrong (similar for the conditions).

Since this PR returns error at line 891, so it won't update ms.Status at line 904 and newStatus.ObservedGeneration at line 924. This is as expected.

图片
图片

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and the error that cannot get machine Node (e.g. network issue, or apiserver unavailable temporarily) should also be recovered soon after reconciling again. Are there other "unhappy path" you meant?

Yup, apiserver down for a longer period of time / indefinitely

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case the apiserver goes down for a longer period of time or indefinitely, KCP controller should catch it and handle correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KCP won't always be able to recover (if that's what you meant)

Copy link
Contributor Author

@jessehu jessehu May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @fabriziopandini as discussed with @sbueringer in current thread, IMHO this PR provides a simple patch to solve the MS status update issue described in #10195, and there would be no need to introduce the suggested handling code (provided by @sbueringer and IMHO a little complex)

  1. I had replied my thought in 🐛 Do not update KCP and MS status when unable to get workload cluster #10229 (comment) and also removed the KCP status handling code as you suggested.
  2. In case the apiserver goes down for a longer period of time or indefinitely, the MS status won't be updated, but KCP controller should catch it and handle correctly. In this case updating MS status won't help.

BTW I was on vacation last week so sorry for the late reply. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will look into it as soon as possible, but we already invested a lot of time in trying to find an acceptable way forward.

Copy link
Member

@sbueringer sbueringer May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to express above that it's not acceptable that the MS status is permanently not updated anymore when a workload cluster stays unreachable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I described that
2) in case the apiserver goes down for a longer period of time or indefinitely, the MS status won't be updated, but KCP controller should catch it and handle correctly. In this case updating MS status won't help.
3) Even if we want to update MS status in this special case, it would be not good to set th MS ready replicas to 0, instead IMHO a new status UnknownReplicas should be added (but this changes the current API definition and is not a trivial change).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure updating the MS status won't contribute to KCP trying to remediate the problem, but it will actually keep the status up-to-date. The status is much more than just the replica fields

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 17, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 15, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 14, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/machineset Issues or PRs related to machinesets cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
5 participants